Nowadays it is possible to simultaneously make several [gen]omic measures in the same sample.
Cancer genome projects have been at the forefront of this trend, and have faced the challenge of integrating these diverse data types [1,2], including RNA transcriptional levels, genotype variation, DNA copy number variation, and epigenetic marks
Leary (2008), PNAS
Integrated analysis of homozygous deletions, focal amplifications, and sequence alterations in breast and colorectal cancers
Gene Sets & Gene Set Analysis
Integration may be helped by annotated collections of gene sets capturing established knowledge about biological processes and pathways.
Because one can make inferences about a given gene set using several different genomic data types, Gene set analysis provides a direct and biologically motivated approach to analyzing multiple data types in an integrated way
Individual features (genes, proteins, miRNA) might not contribute too much to the difference between phenotypes, together, though, they might!
It is not uncommon that similar studies report nonintersecting lists of “top genes”
When selecting different types of features from the same study, connecting them is a complex task.
[1] 291 7
Gene Set Analysis
Instead of ‘list of genes’ think about ‘list of gene sets’
Gene sets encompass larger amount of biological information, this helps make results more interpretable.
Information on the gene set level is comparable across different types of measurements (different platforms)
Multiple testing issue atenuated: we will usually (not always…) have less sets than individual genes
Same biological mechanisms can manifest in different parts of the pathway and via different alterations in different subjects (!!!)
Overrepresentation Analysis (OR Analysis)
Image
OR Analysis Example
OR Analysis Analysis Example
Image
Drawbacks of OR Analysis
A small list of differentially-expressed genes that makes for the counts in the two by two tables relatively small, which makes the approximation that we used to obtain p values not very good.
The definition of differentially-expressed is somewhat arbitrary. We picked a false discovery rate of 5%. We could have picked 10%, 25%, or some other number.
It is not clear that there’s a natural separation:
we’re going to have a number on the right side that’s in the list of differentially-expressed.
and right next to it a gene that is not in the list of differentially-expressed genes.
There are alternatives
Gene Set Enrichment Analysis. Motivation
Imagine we are interested in another Gene Set, Chromossome XP11
This particular gene set had only one out of the many genes in it called differentially-expressed.
However, when we look at the distribution of t-statistics, or the distribution of the effect size, we note that there’s a slight shift to the left.
GSEA provides methods and tools to compute summary statistics that can be used to summarize effects such as this.
Looking at the ChrX p11 gene set
Image
Gene Set Analysis in more detail
Image
A two-stage approach for GSEA
Stage I: compute some gene to phenotype association scores (say, t-values) and rank genes according to these values.
Stage II: check whether the distribution of the ranks is different in a given set vs
the set formed by the rest of the genes (‘competitive null’).
or vs the distribution of the ranks in the same set when there is no association with the phenotype (‘self-contained null’).
Infer enriched sets, say by ranking sets according to the outcome of the Mann-Whitney or Wilcoxon test.
Decision about a geneset is overver or underrepresented is based on doing a Wilcoxon test
using ttests scores as the response variable
and the presence/absence of a gene in each set as grouping factor.
Gene Set 1 : QCAT
===================
Test for UnderExpression : 1.803834e-06
Gene Set 2 : QCMAT
====================
Test for UnderExpression : 1.569614e-07
Gene Set 3 : GRIT1
====================
Test for UnderExpression : 4.503076e-13
Gene Set 8 : IRRAT
====================
Test for UnderExpression : 0.0004002614
Gene Set 14 : IRITD5
======================
Test for UnderExpression : 3.870342e-05
Gene Set 15 : KT1
===================
Test for OverExpression : 5.203374e-08
Options for GSEA
-There are many options available. Most difficult option: choose amnong them.
Roast Method
Included in limma.
Efron & Tibshirani: GSA
Included in SAM.
Available for different experimental layouts: Paired, Continuous, Survival
Broad Institute’s: `
Classical GSEA
Bioconductor: PAGE
Combines results with network visualization
Suppose we have several data types
Image
Two possibilities for integration
First integrate-then-GSEA (Integrative Approach or Stage I integration)
Compute gene-to-phenotype association scores using all available data types (say, using logistic regression or other linear model)
First GSA-then-integrate (Meta-Analytic approach or Stage II integration)
Use, say, Wilcoxon p-values and take their geometric average, or take the smallest one across all data types (some consensus measurement)
Stage I integration in detail
Heterogeneous data is integrated into a single gene-specific score \(s_g(X^1, X^2, ..., X, Y)\), that draws from all the measurements available from gene \(g\) across all the dimensions studied,
It is followed by one-dimensional GSA. \[
\phi(E(Y_i|X_{gi}^1,...,X_{gi}^ d )) = \sum_{d\in\{1,...,D\}} S_{gi}^d \beta_gi^ d
\] where φ is a link function and i the biological sample.
For each gene, the Stage I score can be provided by a measure of the overall fit of the model, say, a likelihood ratio for comparing this model to the “null” model in which all the \(\beta_g^d\) coefficients are zero.
In Stage II these scores can then be analyzed using traditional methods, finally giving set-specific scores \(t_s(s, M_s)\).
Stage II integration: GSA + Integration
This approach starts as a standard one-dimensional GSA:
we determine a gene-to-phenotype association scores separately for each dimension \(s_g^d(X^d, Y)\),
and, in Stage II, we compute set-specific scores \(t_s^d(s, M_s), d \in 1, ..., D\), for each dimension.
Next these scores (e.g. p-values) can be integrated, say, by averaging: \[
t_s (s, M_s)= avg_{d\in \{1,...,D\}}t_s^d(s, M_s),
\] when evidence of significance from several data types is needed, or by taking the extremum score: \[
t_s (s, M_s)= max_{d\in \{1,...,D\}}t_s^d(s, M_s),
\] when strong evidence from a single dimension seems to be sufficient
Case studies and examples
The R/Bioconductor package RTopper implements the different approaches described